39 research outputs found

    Three-Way Joins on MapReduce: An Experimental Study

    Full text link
    We study three-way joins on MapReduce. Joins are very useful in a multitude of applications from data integration and traversing social networks, to mining graphs and automata-based constructions. However, joins are expensive, even for moderate data sets; we need efficient algorithms to perform distributed computation of joins using clusters of many machines. MapReduce has become an increasingly popular distributed computing system and programming paradigm. We consider a state-of-the-art MapReduce multi-way join algorithm by Afrati and Ullman and show when it is appropriate for use on very large data sets. By providing a detailed experimental study, we demonstrate that this algorithm scales much better than what is suggested by the original paper. However, if the join result needs to be summarized or aggregated, as opposed to being only enumerated, then the aggregation step can be integrated into a cascade of two-way joins, making it more efficient than the other algorithm, and thus becomes the preferred solution.Comment: 6 page

    Clearing Contamination in Large Networks

    Full text link
    In this work, we study the problem of clearing contamination spreading through a large network where we model the problem as a graph searching game. The problem can be summarized as constructing a search strategy that will leave the graph clear of any contamination at the end of the searching process in as few steps as possible. We show that this problem is NP-hard even on directed acyclic graphs and provide an efficient approximation algorithm. We experimentally observe the performance of our approximation algorithm in relation to the lower bound on several large online networks including Slashdot, Epinions and Twitter. The experiments reveal that in most cases our algorithm performs near optimally

    Algebraic rewritings for optimizing regular path queries

    Get PDF
    AbstractRewriting queries using views is a powerful technique that has applications in query optimization, data integration, data warehousing, etc. Query rewriting in relational databases is by now rather well investigated. However, in the framework of semistructured data the problem of rewriting has received much less attention. In this paper we focus on extracting as much information as possible from algebraic rewritings for the purpose of optimizing regular path queries. The cases when we can find a complete exact rewriting of a query using a set a views are very “ideal”. However, there is always information available in the views, even if this information is only partial. We introduce “lower” and “possibility” partial rewritings and provide algorithms for computing them. These rewritings are algebraic in their nature, i.e. we use only the algebraic view definitions for computing the rewritings. We do not use any pairs (tuples) of objects for computing the rewritings. This fact makes them a main memory product, which can be used for reducing secondary memory and remote access. After the main memory algebraic computation of the rewritings there is a second phase, with secondary memory access, for deriving the pairs of objects in the query answer. We give two algorithms for utilizing the partial lower and partial possibility rewritings to decrease the number of secondary memory accesses

    Query processing using views in semistructured databases

    Get PDF
    Since its introduction, XML, the eXtensible Markup Language, has quickly emerged as the universal format for publishing and exchanging data in the World Wide Web. As a result, data sources, including object-relational databases, are now faced with a new class of users: clients and customers who would like to deal directly with XML data rather than being forced to deal with the data source particular schema and query languages. XML is also rapidly becoming popular for representing web data as it brings a finely granulated structure to the web information and exposes the semantics of the web content. In all these web applications including electronic commerce and intelligent agents, view mechanisms are recognized as critical and are being widely employed to represent, users' specific interests. Rewriting the user queries using views is a powerful technique in the above described applications, which can be categorized as data integration, data warehousing and query optimization. In this study we identify some difficulties with currently known methods for using rewritings in XML-like "semistructured" databases. We study the problem in two realistic scenarios. The first one is related to information integration systems such as the Information Manifold, in which the data sources are modelled as sound views over a global schema. The second scenario, is query optimization using cached views. In this setting we propose two kinds of algebraic rewritings that focus on extracting as much information as possible from the views for the purpose of optimizing regular path queries, which are the building block of all the query languages for semistructured data

    Query optimization using views in semistructured databases

    Get PDF
    A wealth of query languages for semistructured graph data has emerged, and what almost all of these languages have in common is a graph navigation mechanism, which is not usually found in their relational predecessors. However, the navigation is very expensive since it typically involves many database or network accesses. As a consequence, the optimization of the navigational part of the queries is essential for having commercially attractive query, processors similar to those for relational data. One of the most well known ways for optimizing a query in general, is to use available information in precomputed or materialized views. At the heart of our approaches is the leverage of the concept of query rewritings using views. The query rewriting is a well known problem, which has been solved and deeply investigated for (non-recursive) queries over relational data. However, for navigational queries, which are a subset of the bigger family of the recursive datalog queries, the problem of rewriting is more complex and challenging. In this thesis we study the problem in two realistic scenarios. The first one is related to information integration systems such as the Information Manifold, in which the data sources are modeled as sound views over a global schema. In such cases the "real" database is not available and what we try to compute for a query is its "certain answer according to the views." The second scenario, is query optimization using views when the real database is available, but the views are cheaper to access. In such a case we can use the database to answer parts of the query for which there are no relevant views. We propose algebraic rewritings that focus on extracting as much information as possible from the views for optimizing navigational querie
    corecore